Predicting Home Sale Price in Philadelphia

MUSA 508 Midterm

Author

Akira Di Sandro and Nissim Lebovits

Published

October 11, 2023

Summary

Introduction

Housing market price estimators like Zillow’s Zestimate can give a general idea of a house’s value, but more often than not, professional appraisers and real estate agents do no recommend relying on these tools as they tend to have high error rates. In addition to inconveniencing a home buyer or seller, the high error rates of automated valuation models (AVM; e.g. Zillow’s Zestimate) may misrepresent low-income and majority-minority, especially Black-majority neighborhoods where home price value errors are higher compared to majority-white neighborhoods (cite https://www.urban.org/sites/default/files/publication/103429/how-automated-valuation-models-can-disproportionately-affect-majority-black-neighborhoods_1.pdf). As well, a home’s value is often used to determine a homeowner’s mortgage loan approval and to inform policymakers on the general state of the housing market (cite https://www.iaao.org/media/standards/Standard_on_Automated_Valuation_Models.pdf). Inaccurate housing estimates may lead to misinformed housing and economic policies disproportionately affecting non-white neighborhoods, reifying racial and class disparities in housing.

There are many reasons why these price estimators may be inaccurate (e.g. data availability, inability to capture temporal housing market patterns, etc.), and among them lack of local intelligence stands out. Each urban area has its own culture and unique characteristics that need to be factored into AVMs in order to more accurately predict housing prices. As residents and spatial analysts of the city of Philadelphia, our team was able to apply our knowledge of the city and its neighborhoods to the creation of an improved prediction model.

Even with a deeper understanding of Philadelphia, creating an improved home price estimator model for the city is still a challenging task. All kinds of data are available including health and human services, transportation, public safety, parks and recreation, and food in addition to the traditional housing market and attribute data used in AVMs. Though all of these data are interrelated and inform each other and home price, limiting the amount of predictors to only keep those that are most informative can be a daunting task. We often create new features by redefining (sometimes by combining) variables that improve our prediction of home price. Luckily we implement technology that makes this decision-making process easier.

Another possible challenge when it comes to constructing a hone price estimator model is data availability, or rather unavailability. Some neighborhoods, especially low-income neighborhoods and regions where the population does not trust in government, data that we use in our AVMs may be missing for various reasons. Some of the variables that are most informative to a model may be resources that are unavailable in low-income neighborhoods. On the other hand, even if the resources are available and utilized in the community, a region with high distrust in government may not report information in the government Census and other data surveys, resulting in missing data. Ultimately, if a model relies on variables that are missing in neighborhoods with certain characteristics, it will predict home price with more error in these regions.

Keeping these potential pitfalls in mind, our overall modeling strategy was as follows. We first gathered open-source data about Philadelphia to examine and create variables about Philadelphia houses and neighborhoods that can be predictors in our model. We then input a selection of predictors into an ordinary least squares (OLS) regression, specifically an OLS step-wise regression, to narrow down and determine the combination of the most informative predictors. After assessing our model fit by looking at mean absolute error (MAE) and mean absolute percentage error (MAPE), we revised our selection of predictors and repeated the process until we were happy with our model.

Our newly improved model uses xx independent variables to predict housing prices in Philadelphia and is able to do so with a MAE of xx and MAPE of xx. The most informative predictors in our OLS regression are var1, var2, and var3. We observed clustering in home price predictions with a Moran’s I of 0.615 (p < 0.001) meaning __.

Show the code
library(tidyverse)
library(olsrr)
library(sf)
library(caret) # add dummy vars
library(tmap)
library(fastDummies)
library(tidycensus)
library(spdep)
library(sfdep)
library(curl)
library(zip)
library(rsgeo)
library(janitor)
library(spatstat)
library(maptools)
library(terra)
library(ggthemr)
library(ggcorrplot)
library(missForest)
library(psych)
library(kableExtra)
library(gridExtra)

tmap_mode('view')
options(tigris_use_cache = TRUE, scipen = 999)
ggthemr('flat')

crs <- "epsg:2272"

set.seed(42)

# load functions from functions.R
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

Methods

Data Sources

We gathered data about the shape and outline of the city of Philadelphia and its neighborhoods from OpenData arcGIS and Azavea’s Github repository (cite Azavea, geo-data, (2012), GitHub repository, https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/), respectively. We also pulled information about commercial centrs from OpenData arcGIS. Housing market and attribute data (including house features, sale price, ownership, etc.) was provided by Zillow (aka studentData), and additional census tract data (including total population, median income, etc.) was gathered from 5-year ACS Census data for {year}. We would like to note here that we previously had gathered data from {source1} and {source2} about tree canopy cover and shootings, respectively, but our exploratory data anlalysis found that these data did not help us improve our model.

Data Wrangling

Exploratory Analysis

We first limited studentData to rows with realistic values for sale price (sale_price > 1). After combining all data described above (matching by geographical location), we explored the relationship between variables and sale price. Scatterplots (like those of Figure xx) that plot sale price as a function of interesting predictors.

Feature Engineering

We observed some impossible or missing values in our data, namely values of zero or one in total_area and total_livable_area, missing values for spatial lag in price, zero for number of rooms, bedrooms and bathrooms, and zero for year_built. Though we knew why some of these variables were missing (e.g. spatial lag of price was missing for those that had a house in the challenge set as one of the 15 nearest neighbors since we set sale price for those houses as missing for this calculation), we had to get creative for the other cases. The different solutions we implemented were (1) replacing impossible values as NA and imputing more reasonable values (in the case of total_area and total_livable_area); (2) replacing zeros with ones for number of rooms, bathrooms, and bedrooms; and (3) replacing zeros for year_built with the mean value of year_built.

For categorical variables such as building type price class, quality grade, and neighborhood price class, we created dummy variables. Dummy variables help us break down each category of a categorical variable into its own feature, usually made up of ones and zeros indicating whether that row has that feature or not, respectively.

Checking for Collinearity

When using numerous predictors in a model, it is important to inspect for collinearity, or how similar or correlated predictors are to each other. We explored these relationships between features by examining a correlation matrix. As shown in Figure 1, some variables such as number of rooms, number of bedrooms and number of bathrooms have strong collinearity. Taking note of these patterns, we still kept our subset of predictors as our next step accounted for collinearity.

Show the code
phl_path <- "https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson"
phl <- st_read(phl_path, quiet = TRUE) %>%
          st_transform(crs = crs)

### base data-----------------------------------------------
data_path <- ("data/2023/studentData.geojson")
data <- st_read(data_path, quiet = TRUE)

drop <- c("objectid", "assessment_date", "beginning_point", "book_and_page", "category_code", 
          "cross_reference", "date_exterior_condition", "house_number", "location", "owner_1", 
          "owner_2", "parcel_number", "recording_date", "registry_number", "sale_date",
          "mailing_address_1", "mailing_address_2", "mailing_care_of", "mailing_zip", "mailing_street", 
          "mailing_city_state", "building_code", "geographic_ward", "state_code", "street_code", 
          "street_name", "street_designation", "street_direction", "census_tract", "suffix",
          "zip_code", "building_code_new", "year_built_estimate", "pin", "unit", "exempt_land",
          "building_code_description"
          )

to_cat <- c("category_code_description", "garage_type")

data <- data %>%
          mutate(non_resident_owner = mailing_address_1 == mailing_street) %>%
          select(-drop) %>%
          mutate_at(to_cat, as.character) %>% 
          # mutate(house_extension = as.numeric(house_extension)) %>% # for some reason, this *really* fucks up the model
          st_transform(crs = crs) %>%
          mutate(number_of_rooms = ifelse(is.na(number_of_rooms), number_of_bedrooms + number_of_bathrooms, number_of_rooms))


data <- data[phl, ]
keep_columns <- sapply(data, function(col) length(unique(col)) > 1) #drop columns with only one unique value (i.e., no variance)
data <- data[, keep_columns]

# as.integer(sqrt(length(unique(data$building_code_description_new))))
# 
# data %>%
#   group_by(building_code_description_new) %>%
#   summarize(avg_sale_price = mean(sale_price)) %>%
#   ggplot() +
#   geom_col(aes(x = reorder(building_code_description_new, avg_sale_price), y = avg_sale_price)) +
#   theme(axis.text.x = element_text(hjust = 1, angle = 45))

# over 800,000
# over 600,000
# 0er 400,000
# over 200,000
# under 200,000

avg_sale_prices_x_building_type <- data %>%
  group_by(building_code_description_new) %>%
  summarize(avg_sale_price = mean(sale_price)) %>%
  mutate(building_type_price_class = case_when(
    avg_sale_price > 800000 ~ "most expensive",
    avg_sale_price <= 800000 & avg_sale_price > 600000 ~ "more expensive",
    avg_sale_price <= 600000 & avg_sale_price > 400000 ~ "expensive",
    avg_sale_price <= 400000 & avg_sale_price > 200000 ~ "less expensive",
    avg_sale_price <= 200000 ~ "least expensive",
  )) %>%
  select(building_code_description_new, building_type_price_class) %>%
  st_drop_geometry()

data <- left_join(data, avg_sale_prices_x_building_type, by = "building_code_description_new")


# data %>%
#   group_by(quality_grade) %>%
#   summarize(avg_price = mean(sale_price)) %>%
#   arrange(desc(avg_price)) %>%
#   ggplot() +
#   geom_col(aes(x = reorder(quality_grade, avg_price),
#                y = avg_price))

data <- data %>%
          mutate(quality_grade = case_when(
            quality_grade == "X" ~ "Highest",
            quality_grade %in% c("X-", "A+", "A", "A-") ~ "High",
            quality_grade %in% c("C", 'D', "D-", "D+", "6", "D-", "E", "C-", "E-", "E+") ~ "Lowest",
            TRUE ~ "Mid"
          ))


### neighborhoods-----------------------------------------------
hoods_path <- 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
hoods <- st_read(hoods_path, quiet = T) %>%
  st_transform(crs = crs) %>%
  select(mapname)

data <- st_join(data, hoods)

# 
# as.integer(sqrt(length(unique(data$mapname))))
# 
# data %>%
#   group_by(mapname) %>%
#   summarize(avg_sale_price = mean(sale_price)) %>%
#   arrange(desc(avg_sale_price)) %>%
#   ggplot() +
#   geom_histogram(aes(x = avg_sale_price), bins = 12) +
#   scale_x_continuous(breaks = c(50000, 100000, 250000, 500000, 1000000))

avg_sale_prices_x_hood <- data %>%
  group_by(mapname) %>%
  summarize(avg_sale_price = mean(sale_price)) %>%
  mutate(hood_price_class = case_when(
    avg_sale_price > 750000 ~ "most expensive",
    avg_sale_price <= 750000 & avg_sale_price > 500000 ~ "more expensive",
    avg_sale_price <= 500000 & avg_sale_price > 250000 ~ "expensive",
    avg_sale_price <= 250000 & avg_sale_price > 100000 ~ "less expensive",
    avg_sale_price <= 100000 ~ "least expensive",
  )) %>%
  select(mapname, hood_price_class) %>%
  st_drop_geometry()

data <- left_join(data, avg_sale_prices_x_hood, by = "mapname")


### ACS-----------------------------------------------
phl_acs <- readRDS("phl_acs.RData")
phl_acs <- phl_acs %>%
              select(-c(
                medRent,
                pctPov
              ))
data <- st_join(data, phl_acs) %>% 
  mutate(pctWhite = ifelse(pctWhite > 1, 1, pctWhite)) # there were some values > 1 here. not sure what's happening 

### tree canopy cover-----------------------------------------------
# url <- "https://national-tes-data-share.s3.amazonaws.com/national_tes_share/pa.zip.zip"
# tmp_file <- tempfile(fileext = ".zip")
# curl_download(url, tmp_file)
# 
# unzipped_folder_1 <- tempfile()
# unzip(tmp_file, exdir = unzipped_folder_1)
# shp_files <- list.files(unzipped_folder_1, pattern = "\\.shp$", recursive = TRUE, full.names = TRUE)
# tree_canopy_gap <- st_read(shp_files[1], quiet = TRUE)  # assuming there's only one .shp file
# phl_tree_canopy <- tree_canopy_gap %>%
#   filter(state == "PA",
#          county == "Philadelphia County") %>%
#   transmute(tree_cover = 1 - tc_gap) %>%
#   st_transform(crs = crs)
# 
# data <- st_join(data, phl_tree_canopy)

### spatial lag of sale price-----------------------------------------------
# data <- data %>% 
#           mutate(nb = st_knn(geometry, k = 15),
#                  wt = st_weights(nb),
#                  price_lag = st_lag(sale_price, nb, wt, na_ok = T)) %>%
#           select(-c(nb, wt))

data <- data %>%
          mutate(sale_price_NA = ifelse(sale_price == 0, NA, sale_price),
                 nb = st_knn(geometry, k = 15),
                 wt = st_weights(nb),
                 price_lag = st_lag(sale_price_NA, nb, wt, na_ok = T)) %>%
          select(-c(nb, wt, sale_price_NA))

### proximity to commmercial corridors-----------------------------------------------
corridors_path <- "https://opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson"
corridors <- st_read(corridors_path, quiet= TRUE) %>% st_transform(crs = crs)

nearest_fts <- sf::st_nearest_feature(data, corridors)

# convert to rsgeo geometries
x <- rsgeo::as_rsgeo(data)
y <- rsgeo::as_rsgeo(corridors)

# calculate distance
data$dist_to_commerce <- rsgeo::distance_euclidean_pairwise(x, y[nearest_fts])

### proximity to downtown-----------------------------------------------
# downtown <- st_sfc(st_point(c(-75.16408, 39.95266)), crs = 4326)
# downtown_sf <- st_sf(geometry = downtown)
# downtown_sf <- downtown_sf %>% st_transform(crs= crs)
# 
# nearest_fts <- sf::st_nearest_feature(data, downtown_sf)
# 
# # convert to rsgeo geometries
# x <- rsgeo::as_rsgeo(data)
# y <- rsgeo::as_rsgeo(downtown_sf)
# 
# # calculate distance
# data$dist_to_downtown <- rsgeo::distance_euclidean_pairwise(x, y[nearest_fts])

### shootings density-----------------------------------------------
# shootings_url <- 'https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+shootings+WHERE+year+%3E+2022&filename=shootings&format=geojson&skipfields=cartodb_id'
# shootings <- st_read(shootings_url, quiet = TRUE) %>% 
#                   st_transform(crs = crs) %>%
#                   filter(!st_is_empty(geometry))
# 
# sp_points <- as(shootings, "Spatial")
# ppp_object <- as.ppp(sp_points)
# kde <- densityAdaptiveKernel(ppp_object)
# kde_spatraster <- rast(kde)
# 
# guncrime <- terra::extract(kde_spatraster, data) %>%
#               transmute(ID = ID,
#                         guncrime_density = scale(lyr.1))
# 
# data <- cbind(data, guncrime) 
Show the code
# data %>%
#   group_by(mapname) %>%
#   summarize(avg_sale_price = mean(sale_price)) %>%
#   arrange(desc(avg_sale_price)) %>%
#   head()

# m <- cor(numeric_only %>% na.omit())
# corrplot::corrplot(m, na.action = na.exclude)

# numeric_only <- data %>% st_drop_geometry() %>% select(where(is.numeric))
# ols_step_best_subset(model)
# model <- lm(sale_price ~ ., data = numeric_only)
# aic_preds <- ols_step_both_aic(model)$predictors
keep_vars <- c("price_lag",
               "garage_spaces",
               "total_livable_area",
               "medHHInc",
               "number_of_rooms",
               "number_of_bathrooms",        
               "depth",
               "off_street_open",
               "fireplaces",
               "interior_condition",
               "number_stories",
               "pctWhite",
               "number_of_bedrooms",
               "exterior_condition",
               "dist_to_commerce",
               "year_built",
               "totPop",
               "total_area",
               "sale_price", 
               "hood_price_class", 
               "building_type_price_class", 
               "quality_grade",
               "musaID",
               "toPredict") # keeping toPredict to split data after creating dummy columns

data_to_dummy <- data %>%
                  select(all_of(keep_vars)) # for making summary statistics, keep categorical data as is and add imputed values

# count sum of missing values
# na_count <- c()
# for (i in 1:ncol(data_to_dummy)) {
#     row1 <- data_to_dummy[,i]
#     na_count <- c(na_count,sum(is.na(row1)))
# }
# 
# na_count <- data.frame(name = names(data_to_dummy), na_count) %>% 
#   filter(na_count > 0)
# columns with missing data: garage_spaces, medHHInc, number_of_rooms, number_of_bathrooms, depth, off_street_open, fireplaces, interior_condition, number_stories, number_of_bedrooms, exterior_condition, total_area

dummied_data <- data_to_dummy %>% 
                  dummy_cols(select_columns = c("hood_price_class",
                                                "building_type_price_class",
                                                "quality_grade")) %>%
                    clean_names() %>%
                    select(-c(hood_price_class,
                              building_type_price_class,
                              quality_grade))

# impute missing data
# to do this, have to remove to_predict (no characters are allowed + this has no NAs)
# convert binary + integer (number of rooms) data into factors before imputing
# there are some rows with 0 or 1 for total_area and total_livable_area. since this is impossible, we're replacing them with NA and imputing

# dummied_data_toimpute <- dummied_data %>% 
#   st_drop_geometry() %>% 
#   dplyr::select(-c(geometry,to_predict)) %>% 
#   mutate(total_area = ifelse(total_area < 2, NA, total_area),
#          total_livable_area = ifelse(total_livable_area < 2, NA, total_livable_area),
#          garage_spaces = as.factor(garage_spaces),
#          number_of_rooms = as.factor(number_of_rooms),
#          number_of_bathrooms = as.factor(number_of_bathrooms),
#          number_of_bedrooms = as.factor(number_of_bedrooms),
#          fireplaces = as.factor(fireplaces),
#          interior_condition = as.factor(interior_condition),
#          exterior_condition = as.factor(exterior_condition),
#          number_stories = as.factor(number_stories))

# method without turning num vars to factors, round to nearest integer where necessary after imputation
# number of bedrooms, bathrooms, and rooms == 0 need to be replaced with 1
# year_built == 0 needs to be replaced with average year
dummied_data_toimpute2 <- dummied_data %>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(geometry,to_predict)) %>% 
  mutate(total_area = ifelse(total_area < 2, NA, total_area),
         total_livable_area = ifelse(total_livable_area < 2, NA, total_livable_area))

# imp_data <- missForest(dummied_data_toimpute,ntree = 20)$ximp %>%
#               cbind(dummied_data %>% dplyr::select(to_predict))

# when rendering, suppressing this chunk for now for sake of time
# imp_data2 <- missForest(dummied_data_toimpute2,ntree = 20)$ximp %>%
#               cbind(dummied_data %>% dplyr::select(to_predict))

# saving imp_data so I don't have to run it again
# write_csv(imp_data, "imputed_data_wfactors_231010.csv")
# write_csv(imp_data2, "imputed_data_231010.csv")

# imp_data <- read_csv("imputed_data_wfactors.csv")
imp_data2 <- read_csv("imputed_data_231010.csv")

# for imp_data2, round to nearest whole number for garage_spaces, number_of_rooms, number_of_bathrooms, fireplaces, interior_condition,number_stories,number_of_bedrooms,exterior_condition
imp_data <- imp_data2 %>% 
  mutate(garage_spaces = round(garage_spaces),
         number_of_rooms = round(number_of_rooms),
         number_of_bathrooms = round(number_of_bathrooms),
         fireplaces = round(fireplaces),
         interior_condition = round(interior_condition),
         number_stories = round(number_stories),
         number_of_bedrooms = round(number_of_bedrooms),
         exterior_condition = round(exterior_condition))

reg_data <- imp_data %>%
              filter(to_predict == "MODELLING") %>% # only keep modelling data
              select(-c(to_predict))

# geom and musaID to add back later
reg_data_geom <- data_to_dummy %>% 
  filter(toPredict == "MODELLING") %>%
  dplyr::select(musaID,geometry)
Show the code
# reorder numeric data
sum_stat_data <- data_to_dummy %>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(garage_spaces, medHHInc, number_of_rooms, number_of_bathrooms, depth, off_street_open, fireplaces, musaID,
                   interior_condition, number_stories, number_of_bedrooms, exterior_condition, total_area, sale_price, 
                   price_lag, total_livable_area)) %>% 
  cbind(imp_data %>% dplyr::select(garage_spaces, med_hh_inc, number_of_rooms, number_of_bathrooms, depth,price_lag, 
                                   off_street_open, fireplaces,interior_condition, number_stories,number_of_bedrooms,
                                   exterior_condition, total_area,total_livable_area)) %>% 
  dplyr::select( # select in the desired order
    total_area, total_livable_area, number_of_rooms, number_of_bathrooms, number_of_bedrooms, number_stories, garage_spaces, 
    fireplaces, depth, interior_condition, exterior_condition, year_built, quality_grade, building_type_price_class,
    off_street_open, # internal characteristics
    dist_to_commerce, # amenities/public services
    price_lag, med_hh_inc, pctWhite, totPop, hood_price_class  # spatial variables
  )

# summary statistics for numeric variables
sum_stat_table <- sum_stat_data %>% 
  dplyr::select(-c(building_type_price_class,quality_grade,hood_price_class)) %>% 
  psych::describe() %>% dplyr::select(mean:median,min,max) %>% 
  mutate(Mean = round(mean), 
         SD = round(sd), 
         Median = round(median), 
         Min = min, 
         Max = round(max)) %>% 
  dplyr::select(-(mean:max)) %>% 
  as.data.frame()

sum_stat_table$Description <- 
  c("Total area in square feet",
    "Total liveable area in square feet",
    "", "", "", "",
    "Number of garage spaces",
    "Number of fireplaces",
    "Depth, as measured from the principal street back to the rear property line or secondary street",
    "Overall condition of the interior",
    "Overall condition of the exterior",
    "", "",
    "Distance to nearest commercial center",
    "Spatial lag of sale price, 15 nearest neighbors",
    "Median household income, census tract level data",
    "Percent of population that is white, census tract level data",
    "Total population, census tract level data")

sum_stat_table %>% 
  kbl(caption = "Table 1: Summary Statistics of numeric variables", align = rep("c", 8)) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 1: Summary Statistics of numeric variables
Mean SD Median Min Max Description
total_area 1932 3795 1244 150 226512 Total area in square feet
total_livable_area 1336 547 1210 64 15246 Total liveable area in square feet
number_of_rooms 4 2 4 0 32
number_of_bathrooms 1 1 1 0 8
number_of_bedrooms 3 1 3 0 31
number_stories 2 1 2 1 5
garage_spaces 0 1 0 0 72 Number of garage spaces
fireplaces 0 0 0 0 5 Number of fireplaces
depth 76 36 73 0 1115 Depth, as measured from the principal street back to the rear property line or secondary street
interior_condition 4 1 4 0 7 Overall condition of the interior
exterior_condition 4 1 4 1 7 Overall condition of the exterior
year_built 1937 51 1925 0 2024
off_street_open 2005 1705 1535 14 12685
dist_to_commerce 670 638 546 0 7384 Distance to nearest commercial center
price_lag 268941 174444 226580 35440 1913110 Spatial lag of sale price, 15 nearest neighbors
med_hh_inc 54554 25385 50227 9276 155000 Median household income, census tract level data
pctWhite 0 0 0 0 1 Percent of population that is white, census tract level data
totPop 4691 1690 4473 0 9146 Total population, census tract level data
Show the code
# descriptive statistics for categorical data
sum_stat_data %>% 
  dplyr::select(building_type_price_class) %>% 
  mutate(building_type_price_class = 
           factor(building_type_price_class, 
           levels = c("least expensive","less expensive","expensive","more expensive","most expensive"))) %>% 
  table() %>% as.data.frame() %>% 
  mutate(Description = c("Building types with average sale price of under 200,000 USD",
                         "Building types with average sale price between 200,000.01 and 400,000 USD",
                         "Building types with average sale price between 400,000.01 and 600,000 USD",
                         "Building types with average sale price between 600,000.01 and 800,000 USD",
                         "Building types with average sale price of over 800,000.01 USD")) %>% 
  kbl(caption = "Table 2: Descriptive statistics of sales price classes of building type", 
      align = rep("c", 8), col.names = c("Building Type", "Row count", "Description")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 2: Descriptive statistics of sales price classes of building type
Building Type Row count Description
least expensive 6679 Building types with average sale price of under 200,000 USD
less expensive 15484 Building types with average sale price between 200,000.01 and 400,000 USD
expensive 1156 Building types with average sale price between 400,000.01 and 600,000 USD
more expensive 552 Building types with average sale price between 600,000.01 and 800,000 USD
most expensive 10 Building types with average sale price of over 800,000.01 USD
Show the code
sum_stat_data %>% 
  dplyr::select(hood_price_class) %>% 
  mutate(hood_price_class = 
           factor(hood_price_class, 
           levels = c("least expensive","less expensive","expensive","more expensive","most expensive"))) %>% 
  table() %>% as.data.frame() %>% 
  mutate(Description = c("Houses in a neighborhood with average sale price of under 100,000 USD",
                         "Houses in a neighborhood with average sale price between 100,000.01 and 250,000 USD",
                         "Houses in a neighborhood with average sale price between 240,000.01 and 500,000 USD",
                         "Houses in a neighborhood with average sale price between 500,000.01 and 750,000 USD",
                         "Houses in a neighborhood with average sale price of over 750,000.01 USD")) %>% 
  kbl(caption = "Table 3: Descriptive statistics of sales price classes of nieghborhoods", 
      align = rep("c", 8), col.names = c("Neighborhood", "Row count", "Description")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 3: Descriptive statistics of sales price classes of nieghborhoods
Neighborhood Row count Description
least expensive 1763 Houses in a neighborhood with average sale price of under 100,000 USD
less expensive 11746 Houses in a neighborhood with average sale price between 100,000.01 and 250,000 USD
expensive 8301 Houses in a neighborhood with average sale price between 240,000.01 and 500,000 USD
more expensive 1445 Houses in a neighborhood with average sale price between 500,000.01 and 750,000 USD
most expensive 626 Houses in a neighborhood with average sale price of over 750,000.01 USD
Show the code
sum_stat_data %>% 
  dplyr::select(quality_grade) %>% 
  mutate(quality_grade = 
           factor(quality_grade, 
           levels = c("Lowest","Mid","High"))) %>% 
  table() %>% as.data.frame() %>% 
  mutate(Description = c("Houses with a quality grade of \"X\"",
                         "Houses with a quality grade of \"X-\", \"A+\", \"A\", or \"A-\"",
                         "Houses with a quality grade of \"C\", \"C-\", \"D+\", \"D\", \"D-\", \"E+\", \"E\", or \"E-\"")) %>% 
  kbl(caption = "Table 4: Descriptive statistics of quality grade classes", 
      align = rep("c", 8), col.names = c("Quality Grade", "Row count", "Description")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 4: Descriptive statistics of quality grade classes
Quality Grade Row count Description
Lowest 776 Houses with a quality grade of "X"
Mid 23045 Houses with a quality grade of "X-", "A+", "A", or "A-"
High 60 Houses with a quality grade of "C", "C-", "D+", "D", "D-", "E+", "E", or "E-"
Show the code
numeric_vars <- sum_stat_data %>% 
  dplyr::select(-c(building_type_price_class,quality_grade,hood_price_class)) %>% 
  mutate_if(is.factor,as.numeric) %>% 
  cbind(data_to_dummy %>% st_drop_geometry() %>% dplyr::select(sale_price))
  
ggcorrplot(
  round(cor(numeric_vars), 1), 
  p.mat = cor_pmat(numeric_vars),
  ggtheme = ggplot2::theme_gray,
  colors = c("#025e44", "white", "#6e2e10"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation across numeric variables",
       caption = "Figure 1") 

Show the code
# below is just a template of the code
st_drop_geometry(reg_data) %>%
  dplyr::select(sale_price,dist_to_commerce,price_lag,total_livable_area,med_hh_inc) %>%
  rename(`A. Distance to Commercial Center (ft)` = dist_to_commerce,
         `B. Total Livable Area (sq ft)` = total_livable_area,
         `C. Median Household Income ($)` = med_hh_inc,
         `D. Spatial Lag of Home Sale Price` = price_lag) %>%
  gather(Variable, Value, -sale_price) %>%
  ggplot(aes(Value, sale_price)) +
  ggplot2::geom_point(size = .5, color = "#6e2e10", alpha = 0.4) + 
  geom_smooth(method = "lm", se=F, colour = "#027a59") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of 4 variables of interest",
       y = "Home Sale Price", x = "",
       caption = "Figure 2") +
  theme_minimal()

Show the code
# define color palettes
mypalette1 <- colorRampPalette(c("#bbf0e1","#025e44"))(5)
mypalette2 <- colorRampPalette(c("#f0cab9","#6e2e10"))(5)

# add geometry back in
imp_data_geom <- cbind(imp_data, data_to_dummy %>% 
                         dplyr::select(geometry)) %>% st_as_sf() %>% 
  mutate(num_rooms = as.numeric(number_of_rooms))

# map of dist_to_commerce
ggplot() +
  geom_sf(data = hoods, color = "white", fill = "#f0cab9") + 
  geom_sf(data = imp_data_geom, aes(colour = q5(dist_to_commerce)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = mypalette1,
                      labels=qBr(imp_data_geom,"dist_to_commerce"),
                      name="Distance to Commercial\nCenter (ft)") +
  labs(title = "Distance to nearest Commercial Center",
       subtitle="Philadelphia, PA", # make sure to note that NAs were dropped for visualization purpose
       caption = "Figure 3") +
  mapTheme()

Show the code
buildtype_data <- data_to_dummy %>% 
  mutate(buildtype = factor(building_type_price_class, 
           levels = c("least expensive","less expensive","expensive","more expensive","most expensive")))

# map of build type
ggplot() +
  geom_sf(data = hoods, color = "white", fill = "#f0cab9") + 
  geom_sf(data = buildtype_data, aes(colour = buildtype), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = mypalette1,
                      name = "Price Class of\nHouse Type") +
  labs(title = "Price Class of House Type",
       subtitle="Philadelphia, PA", # make sure to note that NAs were dropped for visualization purpose
       caption = "Figure 4") +
  mapTheme()

Show the code
# map of exterior_condition
ggplot() +
  geom_sf(data = hoods, color = "white", fill = "#f0cab9") + 
  geom_sf(data = imp_data_geom, aes(colour = q5(num_rooms)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = mypalette1,
                      labels=qBr(imp_data_geom,"num_rooms"),
                      name="Number of Rooms") +
  labs(title = "Number of Rooms",
       subtitle="Philadelphia, PA", # make sure to note that NAs were dropped for visualization purpose
       caption = "Figure 5") +
  mapTheme()

Map of Home Sale Prices in Philadelphia

Show the code
# create price per square foot variable
ppsqft <- reg_data %>% dplyr::select(sale_price,total_area) %>% 
      mutate(ppsqft_m = ifelse(total_area != 0 & total_area != 1,sale_price/total_area, NA)) %>%  # there are some places where total_area == 1
      cbind(data_to_dummy %>% filter(toPredict == "MODELLING") %>% dplyr::select(geometry)) %>% 
  st_as_sf() %>% 
      filter(!is.na(ppsqft_m)) # dropping missing ppsqft_m values for a cleaner map

ggplot() +
  geom_sf(data = hoods, color = "white") + 
  geom_sf(data = ppsqft, aes(colour = q5(ppsqft_m)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = mypalette2,
                      labels=qBr(ppsqft,"ppsqft_m"),
                      name=bquote((USD/ft^2))) +
  labs(title = "Home Sale Price in Philadelphia",
       subtitle="Price Per Square Foot", # make sure to note that NAs were dropped for visualization purpose
       caption = "Figure 6") +
  mapTheme()

OLS Model

We used an ordinary least squares (OLS) regression model as our predictive model. The components of an OLS model are the dependent variable, independent variable(s), a coefficient for each independent variable, an intercept, and an error value. In our case, home sale price for a the ith house is the dependent variable (\(Y_i\)), or what we’re trying to predict and all of our predictors that we put into the model are independent variables (\(X_{1_i},X_{2_i},...X_{n_i}\); where there are n predictors). The model calculates coefficients (\(\beta_1,\beta_2,...,\beta_n\)) for each independent variable and the intercept (\(\beta_0\)), where the coefficient is defined as the amount of change in sale price we expect for every unit of change in the independent variable that coefficient is paired with. The intercept is the expected sale price given we have no knowledge of any of the predictors. The error value (\(\epsilon_i\)) represents the difference between the actual home sale price and that predicted by the model, and it varies for every prediction.

\[ Y_i = \beta_0 + \beta_1 X_{1_i} + \beta_2 X_{2_i} + ... + \beta_n X_{n_i} + \epsilon_i \]

There are assumptions that

Show the code
# processed_dd <- preProcess(dummied_data %>% select(-sale_price),
#                            method = c("center", "scale", "YeoJohnson", "nzv"))
# 
# processed_dd
# predict(processed_dd, newdata = dummied_data %>% select(-sale_price))
# 
# processed_dd$method$remove

# 
# model <- lm(sale_price ~ ., data = dummied_data)
# stepwise_selection <- ols_step_both_aic(model)
# keep_vars <- c(stepwise_selection$predictors, "sale_price")
# 
# final_data <- dummied_data %>% select(keep_vars) %>% st_drop_geometry()


# book's method of doing this
inTrain <- createDataPartition(
              y = paste(reg_data$fireplaces, reg_data$number_of_rooms,
                        reg_data$exterior_condition), 
              p = .60, list = FALSE) # p = .60 = split to 60/40 ratio for train/test
phl.training <- reg_data[inTrain,] # defining training set
phl.test <- reg_data[-inTrain,]    # defining testing set

reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(phl.training) %>% 
                             dplyr::select(-musa_id))

# plot(reg.training) # useful plots to look at model fit

phl.test <-
  phl.test %>%
  mutate(sale_price.Predict = predict(reg.training, phl.test), 
         # ^^ actually adding the predicted values from model defined in reg.training
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)

mape1 <- mean(phl.test$sale_price.APE) # 0.2826403
mae1 <- mean(phl.test$sale_price.AbsError) # 69203.8

summ_reg.training <- summary(reg.training)
summ_coeff1 <- as.data.frame(summary(reg.training)$coefficients) # same as summ_coeff

Results

Regression Results

Show the code
footnote1 <- paste0("R-squared: ",round(summ_reg.training$r.squared,3),
                    "; adjusted R-squared: ",round(summ_reg.training$adj.r.squared,3),
                    "; F-statistic: ", round(summ_reg.training$fstatistic[1],3), 
                    " (df=", round(summ_reg.training$fstatistic[2],3), 
                    "; ", round(summ_reg.training$fstatistic[3],3), 
                    "); N = ", length(summ_reg.training$residuals),
                    "; Residual Std. Err.:", round(summ_reg.training$sigma,3), 
                    " (df=",summ_reg.training$df[2],")")

footnote2 <- "* p < 0.1; ** p < 0.05; *** p < 0.01"

summ_coeff1 %>% 
  mutate(`Sale Price` = round(Estimate,3),
         SE = round(`Std. Error`,3),
         `Sale Price` = case_when(
           `Pr(>|t|)` >= 0.1 ~ as.character(`Sale Price`),
           `Pr(>|t|)` < 0.1 & `Pr(>|t|)` >= 0.05 ~ paste0(`Sale Price`,"*"),
           `Pr(>|t|)` < 0.05 & `Pr(>|t|)` >= 0.01 ~ paste0(`Sale Price`,"**"),
           `Pr(>|t|)` < 0.01 ~ paste0(`Sale Price`,"***")
         )) %>% 
  dplyr::select(`Sale Price`,SE) %>% 
  kbl(caption = "Table 5: Summary of Training set LM", align = rep("c", 8)) %>%
  footnote(c(footnote1,footnote2)) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Table 5: Summary of Training set LM
Sale Price SE
(Intercept) 67333.941 66915.935
price_lag 0.598*** 0.014
garage_spaces -9095.229*** 1516.028
total_livable_area 170.248*** 2.845
med_hh_inc -0.071 0.082
number_of_rooms -2961.431 3364.475
number_of_bathrooms 36838.944*** 4113.851
depth -128.532*** 40.307
off_street_open 1.14 0.802
fireplaces 33435.466*** 4089.994
interior_condition -21730.892*** 1972.662
number_stories -7976.931*** 2290.971
pct_white 59191.704*** 5758.684
number_of_bedrooms -3957.291 3570.180
exterior_condition -11802.601*** 1963.261
dist_to_commerce -4.672** 2.044
year_built 22.51 24.253
tot_pop 3.215*** 0.705
total_area 0.35 0.363
hood_price_class_expensive -160532.338*** 9378.483
hood_price_class_least_expensive -171065.9*** 12256.641
hood_price_class_less_expensive -160434.922*** 10795.797
hood_price_class_more_expensive -130305.41*** 8823.786
building_type_price_class_expensive 43729.389 45255.981
building_type_price_class_least_expensive 21307.15 44978.286
building_type_price_class_less_expensive 23819.064 44926.626
building_type_price_class_more_expensive 24040.349 45143.741
quality_grade_high -16771.03 21688.274
quality_grade_lowest 13886.838** 6296.308
Note:
R-squared: 0.728; adjusted R-squared: 0.728; F-statistic: 1372.682 (df=28; 14326); N = 14355; Residual Std. Err.:131462.47 (df=14326)
* p < 0.1; ** p < 0.05; *** p < 0.01
Show the code
data.frame(MAE = sprintf('%.2f',round(mae1,3)),
           MAPE = round(mape1,3)) %>% 
  kbl(caption = "Table 6: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of one test set",
      align = c("c","c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "10em") %>% 
  column_spec(2, width = "10em")
Table 6: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of one test set
MAE MAPE
67954.52 0.324
Show the code
customSummary <- function(data, lev = NULL, model = NULL) {
  mape <- mean((abs(data$obs - data$pred)) / data$obs) * 100
  mae <- mean(abs(data$obs - data$pred))
  rmse <- sqrt(mean((data$obs - data$pred)^2))
  rsq <- cor(data$obs, data$pred)^2
  out <- c(MAE = mae, RMSE = rmse, Rsquared = rsq, MAPE = mape)
  out
}

# train_control <- trainControl(method = "cv", # using leave one out instead of regular cv, so gives better estimate of error
#                               number = 10,
#                               summaryFunction = customSummary)

train_control <- trainControl(method = "cv", # using leave one out instead of regular cv, so gives better estimate of error
                              number = 100,
                              summaryFunction = customSummary)

model <- train(sale_price ~ .,
               data = reg_data %>% dplyr::select(-musa_id),
               trControl = train_control,
               method = "lm",
               na.action = na.exclude)

print_model <- as.data.frame(print(model))
Linear Regression 

23781 samples
   31 predictor

No pre-processing
Resampling: Cross-Validated (100 fold) 
Summary of sample sizes: 23543, 23545, 23543, 23543, 23543, 23543, ... 
Resampling results:

  MAE       RMSE      Rsquared   MAPE    
  68687.82  121375.8  0.7476721  36.74076

Tuning parameter 'intercept' was held constant at a value of TRUE
Show the code
model_resample <- model$resample
# summary(model)

# coefficient data from summary of model
# summ_coeff <- as.data.frame(summary(model)$coefficients)
# 
# model_for_plot <- lm(sale_price ~ ., 
#                data = reg_data)
# ols_plot_resid_fit(model_for_plot)

data.frame(MeanMAE = print_model$MAE,
           SDMAE = round(sd(model_resample$MAE),3)) %>% 
  kbl(caption = "Table 7: Results of 100-fold Cross-Validation",
      align = c("c","c"), 
      col.names = c("Mean MAE", "SD of MAE")) %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "10em") %>% 
  column_spec(2, width = "10em")
Table 7: Results of 100-fold Cross-Validation
Mean MAE SD of MAE
68687.82 6447.468
Show the code
ggplot(data = model$resample) +
  ggplot2::geom_histogram(aes(x = MAE), fill = "#027a59") +
  labs(title = "Distribution of MAE",
       subtitle = "K-Fold Cross Validation; k = 100",
       x = "Mean Absolute Error",
       y = "Count",
       caption = "Figure 7") +
  plotTheme()

Show the code
phl.test %>%
  dplyr::select(sale_price.Predict, sale_price) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  ggplot2::geom_point(color = "#6e2e10", alpha = 0.4) +
  stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour=mypalette1[2]) + 
  stat_smooth(aes(sale_price.Predict, sale_price), 
              method = "lm", se = FALSE, size = 1, colour="#027a59") +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Light teal line represents a perfect prediction; Teal line represents prediction",
       caption = "Figure 8",
       x = "Observed Sale Price", y = "Predicted Sale Price") +
  theme_minimal()

Show the code
resids_data <- cbind(reg.training$model,
                  reg.training$residuals,
                  reg.training$fitted.values,
                  phl.training$musa_id) %>%
                rename(musa_id = `phl.training$musa_id`) %>% 
                left_join(dummied_data %>% dplyr::select(musa_id, geometry), by = "musa_id") %>% 
              mutate(resids = `reg.training$residuals`,
                     pred_value = `reg.training$fitted.values`) %>%
            st_sf()

ggplot() +
  geom_sf(data = hoods, color = "white", fill = "#f0cab9") + 
  geom_sf(data = resids_data, aes(colour = q5(resids)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = mypalette1,
                      labels=qBr(resids_data,"resids"),
                      name="Residuals") +
  labs(title = "Map of Residuals for Test Set",
       subtitle="Philadelphia, PA", # make sure to note that NAs were dropped for visualization purpose
       caption = "Figure 9") +
  mapTheme()

Show the code
# adapted from code chunk: autocorr
data_nb <- phl.test %>% 
            left_join(reg_data_geom, by = c("musa_id" = "musaID")) %>% 
            st_sf() %>% 
            mutate(nb = st_knn(geometry, k = 5),
                   wt = st_weights(nb),
                   .before = 1)


# calculate global moran's i for sale_price
glbl_moran <- global_moran(data_nb$sale_price, data_nb$nb, data_nb$wt)$`I` # 0.5687531

# moran monte carlo
moranMC = global_moran_perm(data_nb$sale_price, data_nb$nb, data_nb$wt, alternative = "two.sided", 999)
moranMC

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 1000 

statistic = 0.55642, observed rank = 1000, p-value <
0.00000000000000022
alternative hypothesis: two.sided
Show the code
moranMCres = moranMC$res |>
  as.data.frame() %>% 
  rename(moranMCres = 1)

ggplot(moranMCres) +
  geom_histogram(aes(moranMCres), bins = 100, fill = "#027a59") +
  geom_vline(xintercept = glbl_moran, col = "#6e2e10") +
  labs(title = "Observed and Permuted Moran's I",
       subtitle = "Brown line indicates the observed Moran's I",
       x = "Moran's I",
       y = "Count",
       caption = "Figure 10") +
  plotTheme()

Show the code
# this can be made prettier
spdep::moran.plot(data_nb$sale_price, nb2listw(data_nb$nb),
           xlab = "Sale Price", 
           ylab = "Spatial Lag")

Show the code
# add caption that says Figure 11

TO DO: Provide a map of your predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE”

Show the code
inTrain2 <- createDataPartition(
              y = paste(imp_data$fireplaces, imp_data$number_of_rooms,
                        imp_data$exterior_condition), 
              p = .60, list = FALSE) # p = .60 = split to 60/40 ratio for train/test
phl.training2 <- imp_data[inTrain2,] # defining training set
phl.test2 <- imp_data[-inTrain2,]    # defining testing set

reg.training2 <- 
  lm(sale_price ~ ., data = as.data.frame(phl.training2) %>% 
                             dplyr::select(-c(musa_id, to_predict)))

# plot(reg.training2) # useful plots to look at model fit

phl.test2 <-
  phl.test2 %>%
  mutate(sale_price.Predict = predict(reg.training2, phl.test2), 
         # ^^ actually adding the predicted values from model defined in reg.training2
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict) %>% 
  left_join(dummied_data %>% dplyr::select(musa_id, geometry), by = "musa_id") %>% st_sf()

mape2 <- mean(phl.test2$sale_price.APE) # 0.3139695
mae2 <- mean(phl.test2$sale_price.AbsError) # 69252.47

summ_reg.training2 <- summary(reg.training2)
summ_coeff2 <- as.data.frame(summary(reg.training2)$coefficients)

ggplot() +
  geom_sf(data = hoods, color = "white", fill = "#f0cab9") + 
  geom_sf(data = phl.test2, aes(colour = q5(sale_price.Predict)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = mypalette1,
                      labels=qBr(phl.test2,"sale_price.Predict"),
                      name="Predicted Sale Price") +
  labs(title = "Map of Predicted Sale Price for Test Set",
       subtitle="Test Set includes both modelling and challenge data; Philadelphia, PA", 
       caption = "Figure 12") +
  mapTheme()

In comparison to the map of observed sale prices (figure 6), we can see in this map of the predicted sale price (including both modelling and challenge data; figure 12) that our model generally predicts prices well, but overpredicts sale price in the northeast region as well as the northwestern tip of Philadelphia.

TO DO: Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood. (code chunk reids is close to what we want, but use MAPE instead of resids)

Show the code
# code adapted from resids code chunk + book
# by "test set" do they mean the new test set with both modelling and challenge data?
# proceeding with old test set (only modelling) for now

mape_data <- phl.test2 %>%
              dplyr::select(sale_price,sale_price.Predict,sale_price.APE,musa_id) %>%
              left_join(dummied_data %>% dplyr::select(musa_id, geometry), by = "musa_id") %>%
              st_sf()

mape_data <- st_join(mape_data, hoods)

mape_x_hood <- st_drop_geometry(mape_data) %>%
  group_by(mapname) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T),
            mean.price = mean(sale_price, na.rm = T),
            mean.pred.price = mean(sale_price.Predict, na.rm = T)) 

mape_x_hood %>%
  ungroup() %>% 
  left_join(hoods) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      # geom_sf(data = bothRegressions, colour = "black", size = .5) +
      scale_fill_gradient(low = mypalette2[1], high = mypalette2[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood",
           caption = "Figure 13") +
      mapTheme()

Show the code
# mape_x_hood <- mape_data %>%
#                   group_by(mapname) %>%
#                   summarize(avg_mape = mean(sale_price.APE),
#                             avg_price = mean()) %>%
#                   arrange(desc(avg_mape)) %>%
#                   select(avg_mape, mapname) %>%
#                   st_drop_geometry() %>%
#                   left_join(., hoods, by = "mapname") %>%
#                   st_sf()
# 
# # redefine qBr to show decimal points
# qBr_3 <- function(df, variable, rnd) {
#   if (missing(rnd)) {
#     as.character(quantile(round(df[[variable]],3),
#                           c(.01,.2,.4,.6,.8), na.rm=T))
#   } else if (rnd == FALSE | rnd == F) {
#     as.character(formatC(quantile(df[[variable]],
#                                   c(.01,.2,.4,.6,.8), na.rm=T),
#                          digits = 3))
#   }
# }
# 
# ggplot() +
#   # geom_sf(data = hoods, color = "white", fill = "#f0cab9") + 
#   geom_sf(data = mape_x_hood, aes(fill = q5(avg_mape)), 
#           show.legend = "point", size = .25) +
#   scale_fill_manual(values = mypalette1,
#                     labels=qBr_3(mape_x_hood,"avg_mape"),
#                     name="MAPE") +
#   labs(title = "Map of MAPE by Neighborhood",
#        subtitle="Test Set includes both modelling and challenge data; Philadelphia, PA", 
#        caption = "Figure 13") +
#   mapTheme()

TO DO: Provide a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood. aki: are they talking about predicted price or actual price? actual price plotted for now

Show the code
st_drop_geometry(mape_x_hood) %>%
  ggplot(aes(mean.price, mean.MAPE)) +
  ggplot2::geom_point(size = .5, color = "#6e2e10", alpha = 0.4) + 
  geom_smooth(method = "lm", se=F, colour = "#027a59") +
  labs(title = "Mean MAPE by neighborhood as a function of mean price by neighborhood",
       y = "Mean MAPE by neighborhood", 
       x = "Mean price by neighborhood",
       caption = "Figure 14") +
  plotTheme()

Show the code
# possibly add neighborhood names of extreme outliers

TO DO: Using tidycensus, split your study area into two groups (perhaps by race or income) and test your model’s generalizability. Is your model generalizable?

Show the code
city_mean_inc <- round(mean(phl_acs$medHHInc,na.rm=T),2)

split_data <- phl_acs %>% 
  mutate(incomeContext = ifelse(medHHInc > city_mean_inc, "High Income", "Low Income"),  # really above/below avg inc
         raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"))

ggplot() + geom_sf(data = na.omit(split_data), aes(fill = incomeContext)) +
  scale_fill_manual(values = c("#027a59", "#6e2e10"), name="Income") +
  labs(title = "Income Context",
       caption = "Figure 15") +
  mapTheme() + theme(legend.position="bottom")

Show the code
ggplot() + geom_sf(data = na.omit(split_data), aes(fill = raceContext)) +
  scale_fill_manual(values = c("#6e2e10", "#027a59"), name="Race") +
  labs(title = "Race Context",
       caption = "Figure 16") +
  mapTheme() + theme(legend.position="bottom")

Show the code
st_join(mape_data, split_data) %>% 
  filter(!is.na(incomeContext)) %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  # spread(incomeContext, mean.MAPE) %>%
  kbl(caption = "Table 8: Test set MAPE by neighborhood income context", align = rep("c", 8),
      col.names = c("Income Context", "Average MAPE")) %>%
    # footnote("70 rows of data were missing median household income and are therefore excluded from this table.") %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "12em") %>% column_spec(2, width = "12em")
Table 8: Test set MAPE by neighborhood income context
Income Context Average MAPE
High Income 25%
Low Income 32%
Show the code
st_join(mape_data, split_data) %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  kbl(caption = "Table 9: Test set MAPE by neighborhood race context", align = rep("c", 8),
      col.names = c("Race Context", "Average MAPE")) %>%
    # footnote("70 rows of data were missing median household income and are therefore excluded from this table.") %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "12em") %>% column_spec(2, width = "12em")
Table 9: Test set MAPE by neighborhood race context
Race Context Average MAPE
Majority Non-White 35%
Majority White 22%
Show the code
# data_nb <- data %>% 
#           mutate(nb = st_knn(geometry, k = 5),
#                  wt = st_weights(nb),
#                  .before = 1)
# 
# 
# # calculate global moran's i for sale_price
# global_moran(data_nb$sale_price, data_nb$nb, data_nb$wt)$`I`
# 
# # moran monte carlo
# moranMC = global_moran_perm(data_nb$sale_price, data_nb$nb, data_nb$wt, alternative = "two.sided", 999)
# moranMC
# 
# moranMCres = moranMC$res |>
#   as.data.frame()
# 
# colnames(moranMCres) = "col1"
# 
# ggplot(moranMCres) +
#   geom_histogram(aes(col1), bins = 100) +
#   geom_vline(xintercept = moranMC$statistic, col = "red") +
#   labs(title = "Histogram of MoranMCres",
#        x = "moranMCres",
#        y = "Frequency",
#        caption = "Figure xx") +
#   theme_minimal() +
#   theme(plot.title = element_text(hjust = 0.5))
# 
# 
# spdep::moran.plot(data_nb$sale_price, nb2listw(data_nb$nb),
#            xlab = "Sale Price", 
#            ylab = "Spatial Lag")
Show the code
# reg <- lm(sale_price ~ ., 
#                data = reg_data,
#           na.action = na.exclude)
# 
# resids_data <- cbind(reg$model, 
#                   reg$residuals, 
#                   reg$fitted.values,
#                   dummied_data$musa_id,
#                   dummied_data$geometry) %>%
#               mutate(resids = `reg$residuals`,
#                      pred_value = `reg$fitted.values`) %>%
#             st_sf()

resids_data <- st_join(resids_data, hoods)

error_x_hood <- resids_data %>%
                  group_by(mapname) %>%
                  summarize(avg_error = mean(abs(resids))) %>%
                  arrange(desc(avg_error)) %>%
                  select(avg_error, mapname) %>%
                  st_drop_geometry() %>%
                  left_join(., hoods, by = "mapname") %>%
                  st_sf()

tmap_mode('plot')

tm_shape(error_x_hood) +
  tm_polygons(col = 'avg_error', border.col = NA, style = 'jenks')

Show the code
# reg_data <- inner_join()
# reg_data$residuals <- reg$residuals
# 
# reg_data <- final_data %>%
#               transmute(
#                stand_resids = rstandard(reg),
#                      spatial_lag = st_lag(stand_resids, 
#                                              nb, 
#                                              weights)
#               )
Show the code
lisa = local_moran(data_nb$sale_price, data_nb$nb, data_nb$wt, nsim = 999)
data_nb <- cbind(data_nb, lisa)

lisa_sample <- sample_n(data_nb, 100)

tmap_mode('view')

tm_shape(lisa_sample) +
  tm_dots(col = "p_ii")

Discussion

Is this an effective model? What were some of the more interesting variables? How much of the variation in prices could you predict? Describe the more important features? Describe the error in your predictions? According to your maps, could you account the spatial variation in prices? Where did the model predict particularly well? Poorly? Why do you think this might be?

Accuracy and Generalizability

Conclusion

Would you recommend your model to Zillow? Why or why not? How might you improve this model?